Abstract
Background: Hematopoietic cell transplantation (HCT) for sickle cell disease (SCD) has excellent outcomes but heterogeneous inter-patient outcomes are a barrier to informed shared decision making. We have previously created sickle predicting outcomes of hematopoietic cell transplantation (SPRIGHT), a machine learning (ML) individualized predictive model. We performed extensive internal cross validation (Chandrashekhar et al JMIR AI 2025) with promising results. The adoption of clinical prediction models is limited by concerns regarding performance reliability and generalizability outside the development setting. External validation remains a critical step to ensure that the model performs consistently across populations. Temporal validation, is a component of external validation using future-era data, to assess robustness to evolving clinical practices and patient profiles. Additionally, data sharing requirements, computational setup, and lack of accessible tools limit the routine adoption of external validation and hinder model reproducibility and integration into collaborative research and clinical workflow.
Objective: To create and demonstrate the use of a novel, shareable, browser-based validation framework built using open-source tools, conduct temporal external validation of the SPRIGHT model by evaluating its predictive performance on data from a temporally distinct cohort and to serve as a platforrm to facilitate future temporal and geographic validations.
Method: We implemented the SPRIGHT model on a web-based application using Gradio, an open-source Python library that enables the creation of simple, interactive interfaces for ML models. The application was deployed on Hugging Face, a widely used open-source platform that supports collaborative machine learning research and application sharing and allows external users to input new data and obtain performance metrics without requiring access to the original model code or dataset. The web application displays discrimination metrics including accuracy, balanced accuracy, recall, precision, and area under the receiver operating characteristic curve (AUC-ROC), and calibration metrics including calibration slope, intercept, and calibration curves.
The SPRIGHT model was originally trained on 1,641 SCD HCT cases in the center for international bone marrow transplant research (CIBMTR)datasets from 1991–2020. For external temporal validation, we used an independent cohort of 286 patients who underwent HCT between 2021–2023, with follow-up data extending into 2025. The outcomes evaluated included overall survival (OS), event-free survival (EFS), graft failure (GF), acute graft-versus-host disease (aGVHD), and chronic GVHD (cGVHD).
Results: The model showed moderate performance, with AUCs of 0.63 (OS), 0.65 (EFS), 0.63 (GF), 0.66 (aGVHD) & 0.66 (cGVHD). Notably, there was a reduction in performance across these metrics compared to internal validation, suggesting early signs of model degradation. We performed chi2 contingency tests and identified data drifts in key clinical variables from before to after 2020. We observed Cohorts with a significant increase in the proportion of HLA-mismatched related haploidentical donors in the post-2020 dataset (p<0.001), along with improved EFS outcomes of haploidentical HCT (p<0.01). Since data- and concept-drift can undermine model reliability and patient safety, we retrained the model aby incorporating post-2021 data while retaining the original model hyperparameters. The updated models maintained strong discriminatory and calibrative performance, with AUCs 0.74 (OS), 0.76 (EFS), 0.73 (GF), 0.66 (aGVHD), and 0.72 (cGVHD). Calibration curves and slope/intercept values, in the <0.5 probability range, thus demonstrating its generalizability and utility across evolving patient populations
.
Conclusion: This work introduces a novel, browser-accessible external validation framework which, by avoiding the logistical barriers such as infrastructure setup and data transfer, supports reproducible and scalable model validation across centers. This work demonstrates that SPRIGHT maintains acceptable performance even after being retrained with new data. This study reinforces the need to monitor and update predictive models in the face of evolving clinical landscapes. Taken together individual-level risk prediction, and continuous, group-level model monitoring, evaluation, and retraining supports clinical utility and generalizability.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal